Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

62 ◾ Bioinformatics

2.2.5 FM-Index

FM-index (Full-text Minute-space index) performs a backward search mechanism on the

top of BWT to find exact pattern matches on a string with a linear computational time

complexity O(N), regardless of the length of the string. Moreover, it could achieve high

compression ratios that allow the indexing of the whole human genome into less than 2.0G

of memory storage [2]. An FN-index searching is performed by using LF mapping matrix.

Searching for a read (a substring) with FM-index uses the backward matching in which the

LF mapping is used repeatedly until matches for the substring are found. FM-index intro-

duces the rank table and lookup table. A rank table is generated from the BWT (column L)

using the character’s rank, which is defined as the number of times a character occurs pre-

viously in the prefix L[1…k] (i.e., lists the occurrence and order of each unique character).

The rank table acts as a function B(c, k), which is explained as follows: if a character c and

a prefix of length k are given, then we can infer the number of times that character occurs

in the prefix. A lookup table lists the index of the first occurrence of each character c from

the first column (column F) of the sorted matrix. The first occurrence of the character c in

the lookup table can be described as C[c].

Figure 2.12 shows Burrows–Wheeler matrix (a), the rank table (b), and the lookup table

(c). Using the rank table and lookup table, the LF mapping can be described as

LF(i) = C[L[i]] + B(L[i], i)

The original string can be recovered using the above rule and backward mechanisms are

as follows:

The last character in the string is “$” on the first row in column F, preceded by “A” on

the first row in column K. So, the last two characters are “A$”. Now we can use the above

rule to find the character that comes before “A$”. Since “A” is on row 1 in column L, we can

locate that character in column F as LF(1) as follows:

LF(1) = C(A] + B(A, 1) = 1 + 2 =2 : The character is on row 2 in column F. So, “A” is pre-

ceded by “G”, which is on row 2 in column L. The last three characters are “GA$”.

LF(2) = C(G] + B(G, 2) = 4 + 1 =5 : The character is on row 5 in column F. So, “G” is pre-

ceded by “G”, which is on row 5 in column L. The last four characters are “GGA$”.

LF(5) = C(G] + B(G, 5) = 4 + 3 =7 : The character is on row 7 in column F. So, “G” is pre-

ceded by “T”, which is on row 7 in column L. The last four characters are “TGGA$”.

LF(7) = C(T] + B(T, 7) = 8 + 1 =9 : The character is on row 9 in column F. So, “T” is pre-

ceded by “C”, which is on row 9 in column L. The last four characters are “CTGGA$”.

We can continue using that rule with the rank table and lookup table until we recover

the entire spring “CTTGGCTGGA$”.

The FM-index can also be used as a pattern searching technique that operates on the

BWT to map read sequences to locations on the BW-transformed reference genome.

Searching for a pattern is a backward process like recovering the original string; a

character is processed at a time, beginning with the last character of the pattern (read

sequence) [11].